27 research outputs found

    GPU-accelerated machine learning techniques enable QSAR modeling of large HTS data

    No full text
    Abstract—Quantitative structure activity relationship (QSAR) modeling using high-throughput screening (HTS) data is a powerful technique which enables the construction of predictive models. These models are utilized for the in silico screening of libraries of molecules for which experimental screening methods are both cost- and time-expensive. Machine learning techniques excel in QSAR modeling where the relationship between structure and activity is often complex and non-linear. As these HTS data sets continue to increase in number of compounds screened, extensive feature selection and cross validation becomes computationally expensive. Leveraging massively parallel architectures such as graphics processing units (GPUs) to accelerate the training algorithms for these machine learning techniques is a cost-efficient manner in which to combat this problem. In this work, several machine learning techniques are ported in OpenCL for GPU-acceleration to enable construction of QSAR ensemble models using HTS data. We report computational performance numbers using several HTS data sets freely available from PubChem database. We also report results of a case study using HTS data for a target of pharmacological and pharmaceutical relevance, cytochrome P450 3A4, for which an enrichment of 94 % of the theoretical maximum is achieved

    Shared Genetic Etiology of Autoimmune Diseases in Patients from a Biorepository Linked to De-identified Electronic Health Records

    No full text
    Autoimmune diseases represent a significant medical burden affecting up to 5-8% of the U.S. population. While genetics is known to play a role, studies of common autoimmune diseases are complicated by phenotype heterogeneity, limited sample sizes, and a single disease approach. Here we performed a targeted genetic association study for cases of multiple sclerosis (MS), rheumatoid arthritis (RA), and Crohn’s disease (CD) to assess which common genetic variants contribute individually and pleiotropically to disease risk. Joint modeling and pathway analysis combining the three phenotypes were performed to identify common underlying mechanisms of risk of autoimmune conditions. European American cases of MS, RA, and CD, (n=119, 53, and 129, respectively) and 1,924 controls were identified using de-identified electronic health records (EHRs) through a combination of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) billing codes, Current Procedural Terminology (CPT) codes, medications lists, and text matching. As expected, hallmark SNPs in MS, such as DQA1 rs9271366 (OR=1.91; p=0.008), replicated in the present study. Both MS and CD were associated with TIMMDC1 rs2293370 (OR = 0.27, p=0.01; OR=0.25, p=0.02; respectively). Additionally, PDE2A rs3781913 was significantly associated with both CD and RA (OR=0.46, p=0.02; OR=0.32, p=0.02; respectively). Joint modeling and pathway analysis identified variants within the KEGG NOD-like receptor signaling pathway and Shigellosis pathway as being correlated with the combined autoimmune phenotype. Our study replicated previously reported genetic associations for MS and CD in a population derived from de-identified EHRs. We found evidence to support a shared genetic etiology between CD/MS and CD/RA outside of the major histocompatibility complex region and identified KEGG pathways indicative of a bacterial pathogenesis risk for autoimmunity in a joint model. Future work to elucidate this shared etiology will be key in the development of risk models as envisioned in the era of precision medicine

    BCL::EMAS — Enantioselective Molecular Asymmetry Descriptor for 3D-QSAR

    No full text
    Stereochemistry is an important determinant of a molecule’s biological activity. Stereoisomers can have different degrees of efficacy or even opposing effects when interacting with a target protein. Stereochemistry is a molecular property difficult to represent in 2D-QSAR as it is an inherently three-dimensional phenomenon. A major drawback of most proposed descriptors for 3D-QSAR that encode stereochemistry is that they require a heuristic for defining all stereocenters and rank-ordering its substituents. Here we propose a novel 3D-QSAR descriptor termed Enantioselective Molecular ASymmetry (EMAS) that is capable of distinguishing between enantiomers in the absence of such heuristics. The descriptor aims to measure the deviation from an overall symmetric shape of the molecule. A radial-distribution function (RDF) determines a signed volume of tetrahedrons of all triplets of atoms and the molecule center. The descriptor can be enriched with atom-centric properties such as partial charge. This descriptor showed good predictability when tested with a dataset of thirty-one steroids commonly used to benchmark stereochemistry descriptors (r<sup>2</sup> = 0.89, q<sup>2</sup> = 0.78). Additionally, EMAS improved enrichment of 4.38 versus 3.94 without EMAS in a simulated virtual high-throughput screening (vHTS) for inhibitors and substrates of cytochrome P450 (PUBCHEM AID891)

    Benchmarking Ligand-Based Virtual High-Throughput Screening with the PubChem Database

    No full text
    With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed

    Introduction to the BioChemical Library (BCL): An Application-Based Open-Source Toolkit for Integrated Cheminformatics and Machine Learning in Computer-Aided Drug Discovery

    Get PDF
    The BioChemical Library (BCL) cheminformatics toolkit is an application-based academic open-source software package designed to integrate traditional small molecule cheminformatics tools with machine learning-based quantitative structure-activity/ property relationship (QSAR/QSPR) modeling. In this pedagogical article we provide a detailed introduction to core BCL cheminformatics functionality, showing how traditional tasks (e.g., computing chemical properties, estimating druglikeness) can be readily combined with machine learning. In addition, we have included multiple examples covering areas of advanced use, such as reaction-based library design. We anticipate that this manuscript will be a valuable resource for researchers in computer-aided drug discovery looking to integrate modular cheminformatics and machine learning tools into their pipelines

    Identification of Metabotropic Glutamate Receptor Subtype 5 Potentiators Using Virtual High-Throughput Screening

    No full text
    Selective potentiators of glutamate response at metabotropic glutamate receptor subtype 5 (mGluR5) have exciting potential for the development of novel treatment strategies for schizophrenia. A total of 1,382 compounds with positive allosteric modulation (PAM) of the mGluR5 glutamate response were identified through high-throughput screening (HTS) of a diverse library of 144,475 substances utilizing a functional assay measuring receptor-induced intracellular release of calcium. Primary hits were tested for concentration-dependent activity, and potency data (EC<sub>50</sub> values) were used for training artificial neural network (ANN) quantitative structure−activity relationship (QSAR) models that predict biological potency from the chemical structure. While all models were trained to predict EC<sub>50</sub>, the quality of the models was assessed by using both continuous measures and binary classification. Numerical descriptors of chemical structure were used as input for the machine learning procedure and optimized in an iterative protocol. The ANN models achieved theoretical enrichment ratios of up to 38 for an independent data set not used in training the model. A database of ∌450,000 commercially available drug-like compounds was targeted in a virtual screen. A set of 824 compounds was obtained for testing based on the highest predicted potency values. Biological testing found 28.2% (232/824) of these compounds with various activities at mGluR5 including 177 pure potentiators and 55 partial agonists. These results represent an enrichment factor of 23 for pure potentiation of the mGluR5 glutamate response and 30 for overall mGluR5 modulation activity when compared with those of the original mGluR5 experimental screening data (0.94% hit rate). The active compounds identified contained 72% close derivatives of previously identified PAMs as well as 28% nontrivial derivatives of known active compounds
    corecore